Skip to content

Add benchmark metrics persistence#2339

Merged
ieaves merged 7 commits intocontainers:mainfrom
ramalama-labs:metrics
Jan 25, 2026
Merged

Add benchmark metrics persistence#2339
ieaves merged 7 commits intocontainers:mainfrom
ramalama-labs:metrics

Conversation

@ieaves
Copy link
Collaborator

@ieaves ieaves commented Jan 22, 2026

Summary by Sourcery

Introduce persistent benchmark result storage and a new CLI for viewing historical benchmarks, while enriching bench output formatting and configuration support.

New Features:

  • Add a benchmarks CLI command with subcommands (currently list) to view stored benchmark results in table or JSON formats with pagination.
  • Persist bench command outputs as structured benchmark records, capturing device, configuration, and result metadata for later inspection.
  • Add configuration support for benchmarks, including a configurable storage folder with sensible defaults.

Enhancements:

  • Extend the bench command to support selectable output formats (table or JSON) and render richer tabular benchmark summaries including model parameters and throughput.
  • Refine configuration file discovery via a cached helper and wire benchmark settings into the global config.
  • Adjust CLI help/tests and llama.cpp benchmark engine spec so llama-bench emits JSON suitable for structured parsing.

Documentation:

  • Document the new benchmarks configuration section and add a dedicated ramalama-benchmarks(1) man page, while updating existing man pages and examples to reference the new commands and options.

Tests:

  • Add unit tests for the benchmarks manager and config documentation expectations, and update existing bench-related e2e/system tests for the new table headers and help behavior.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 22, 2026

Reviewer's Guide

Adds persistent benchmark metrics collection, storage, and querying to RamaLama, including a new benchmarks CLI, schema types for benchmark records, utilities for parsing/printing results, and configuration support for benchmark storage.

File-Level Changes

Change Details Files
Extend bench CLI to support output format selection and normalize subcommand alias handling.
  • Add --format {table,json} option to ramalama bench with default table and document it in ramalama-bench.1.md.
  • Normalize benchmark alias back to bench in post-parse setup so downstream logic only sees bench.
  • Adjust tests that assert bench help/usage so they account for the new option and updated column name expectations.
ramalama/cli.py
ramalama/transports/base.py
docs/ramalama-bench.1.md
test/e2e/test_bench.py
test/system/002-bench.bats
Introduce a persistent benchmarks subsystem with schemas, storage manager, and utilities for parsing and printing run results.
  • Define dataclasses for device info, test configuration, llama-bench results, and benchmark records, plus factory helpers to create versioned objects.
  • Implement JSON/JSONL parsing utilities and a tabular printer that formats benchmark rows (model, params, backend, ngl, threads, test, t/s, etc.).
  • Implement BenchmarksManager to append benchmark records as JSONL in a configurable storage folder and to list all stored benchmarks.
  • Add a specific MissingStorageFolderError (and alias MissingDBPathError) to signal misconfiguration of the benchmarks storage path.
ramalama/benchmarks/schemas.py
ramalama/benchmarks/utilities.py
ramalama/benchmarks/manager.py
ramalama/benchmarks/errors.py
Wire benchmark persistence into transport bench execution and add a CLI to inspect historical results.
  • Change transport bench to always request JSON from llama-bench, parse the JSON output, build BenchmarkRecordV1 instances, then either print as JSON or formatted table depending on --format.
  • On successful bench runs, conditionally save results via BenchmarksManager unless CONFIG.benchmarks.disable is true.
  • Add a new top-level benchmarks subcommand with a list subcommand supporting --limit, --offset, and --format {table,json}, and error handling for missing storage folders.
  • Exclude benchmarks from the generic help-invalid-arg check since it has subcommands and document the new command in the main ramalama(1) manpage and its own ramalama-benchmarks(1) page.
ramalama/cli.py
ramalama/transports/base.py
inference-spec/engines/llama.cpp.yaml
docs/ramalama.1.md
docs/ramalama-benchmarks.1.md
test/system/015-help.bats
Extend configuration to support benchmark-related settings and expose config file discovery paths.
  • Introduce a Benchmarks config dataclass with storage_folder (defaulting under the first existing default config dir) and disable flag, and add it as CONFIG.benchmarks.
  • Add get_default_benchmarks_storage_folder and get_config_file_path helpers to centralize configuration directory/file resolution.
  • Adjust load_file_config to track the list of parsed config files and expose them under settings.config_files.
  • Update unit tests to require documentation of the new benchmarks config section and tweak error messages to inline lists in backticks.
  • Document the ramalama.benchmarks table in docs/ramalama.conf and docs/ramalama.conf.5.md, including a configurable db path example.
ramalama/config.py
test/unit/test_config_documentation.py
docs/ramalama.conf
docs/ramalama.conf.5.md
Refine bench output semantics and minor code cleanup.
  • Change displayed bench table column header from size to params and update regex-based tests accordingly.
  • Update llama.cpp bench command spec to explicitly pass JSON output and common performance-relevant flags instead of using the previous shared options anchor.
  • Perform small type and style cleanups (dict type hint modernization, remove stray blank line in CommandFactory, improve documentation error messages).
test/e2e/test_bench.py
test/system/002-bench.bats
inference-spec/engines/llama.cpp.yaml
ramalama/transports/base.py
ramalama/command/factory.py
test/unit/test_config_documentation.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ieaves, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the benchmarking capabilities of the ramalama CLI by introducing a dedicated command for managing historical benchmark results. It provides a structured way to store, retrieve, and display performance metrics, allowing users to track and analyze model performance over time. The changes integrate seamlessly with existing benchmarking workflows and offer flexible output options.

Highlights

  • New 'benchmarks' command: Introduced a new ramalama benchmarks command to view and interact with historical benchmark results, stored in a local SQLite database.
  • Enhanced 'bench' command: The ramalama bench command now supports a --format option (table or json) for output and automatically saves benchmark results to the new metrics storage.
  • Structured Benchmark Data: New Python modules (errors.py, manager.py, schemas.py, utilities.py) have been added under ramalama/benchmarks to define data structures for benchmark records, device information, and test configurations, and to manage their storage and retrieval.
  • Configuration for Benchmarks: A new [ramalama.benchmarks] section has been added to ramalama.conf and its man page, allowing users to configure the db_path for the benchmark results database.
  • Llama.cpp Integration: The llama.cpp inference engine configuration has been updated to output benchmark results in JSON format, facilitating structured data capture.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ieaves ieaves changed the title Metrics Add benchmark metrics persistence Jan 22, 2026
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 security issue, 5 other issues, and left some high level feedback:

Security issues:

  • Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'. (link)

General comments:

  • In benchmarks_list_cli, when args.format == 'json' you pass a generator ((asdict(item) for item in results)) to json.dumps, which will fail serialization; convert to a list (e.g., [asdict(item) for item in results]) before dumping.
  • In normalize_benchmark_record, the error message uses type(BenchmarkRecord) instead of the actual instance type; update it to type(benchmark) so the raised NotImplementedError reports the correct offending type.
  • The ramalama-benchmarks man page describes results being stored in a SQLite database with a db_path, but BenchmarksManager currently writes JSONL to benchmarks.jsonl; consider aligning the implementation with the documented SQLite behavior or adjusting the configuration naming to avoid confusion.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `benchmarks_list_cli`, when `args.format == 'json'` you pass a generator (`(asdict(item) for item in results)`) to `json.dumps`, which will fail serialization; convert to a list (e.g., `[asdict(item) for item in results]`) before dumping.
- In `normalize_benchmark_record`, the error message uses `type(BenchmarkRecord)` instead of the actual instance type; update it to `type(benchmark)` so the raised `NotImplementedError` reports the correct offending type.
- The `ramalama-benchmarks` man page describes results being stored in a SQLite database with a `db_path`, but `BenchmarksManager` currently writes JSONL to `benchmarks.jsonl`; consider aligning the implementation with the documented SQLite behavior or adjusting the configuration naming to avoid confusion.

## Individual Comments

### Comment 1
<location> `ramalama/cli.py:572-574` </location>
<code_context>
+            print("No benchmark results found")
+            return
+
+        if args.format == "json":
+            output = (asdict(item) for item in results)
+            print(json.dumps(output, indent=2, sort_keys=True))
+        else:
+            print_bench_results(results)
</code_context>

<issue_to_address>
**issue (bug_risk):** JSON output path builds a generator, which `json.dumps` cannot serialize.

`output` is a generator (`(asdict(item) for item in results)`), which `json.dumps` cannot serialize and will raise a `TypeError`. Please convert to a concrete structure first, e.g. `output = [asdict(item) for item in results]`, or switch to a streaming JSON approach.
</issue_to_address>

### Comment 2
<location> `ramalama/benchmarks/schemas.py:114` </location>
<code_context>
+    configuration: TestConfigurationV1
+    result: LlamaBenchResultV1
+    version: Literal["v1"] = "v1"
+    created_at: str = datetime.now(timezone.utc).isoformat()
+    device: DeviceInfoV1 = field(default_factory=DeviceInfoV1.current_device_info)
+
</code_context>

<issue_to_address>
**issue (bug_risk):** `created_at` default is evaluated at import time, so all records share the same timestamp.

Because this default is evaluated at class definition time, every `BenchmarkRecordV1` created without an explicit `created_at` will have the same timestamp. Use a `default_factory`, e.g. `field(default_factory=lambda: datetime.now(timezone.utc).isoformat())`, to generate a fresh value per instance.
</issue_to_address>

### Comment 3
<location> `ramalama/benchmarks/schemas.py:202-206` </location>
<code_context>
+    raise NotImplementedError(f"No supported benchmark schemas for version {version}")
+
+
+def normalize_benchmark_record(benchmark: BenchmarkRecord) -> BenchmarkRecordV1:
+    if isinstance(benchmark, BenchmarkRecordV1):
+        return benchmark
+
+    raise NotImplementedError(f"Received an unsupported benchmark record type {type(BenchmarkRecord)}")
</code_context>

<issue_to_address>
**issue (bug_risk):** Error message uses `type(BenchmarkRecord)` instead of the actual `benchmark` instance.

This will always report the base class, not the actual runtime type, which makes the error misleading. Using `type(benchmark)` would correctly show the unexpected concrete type received.
</issue_to_address>

### Comment 4
<location> `ramalama/benchmarks/schemas.py:40-47` </location>
<code_context>
+    container_runtime: str = ""
+    inference_engine: str = ""
+    version: Literal["v1"] = "v1"
+    runtime_args: dict[str, Any] | None = None
+
+
</code_context>

<issue_to_address>
**suggestion:** The declared type of `runtime_args` does not match how it is populated.

In `BaseTransport.bench`, `runtime_args` receives `cmd` (a `list[str]`), but it’s annotated as `dict[str, Any] | None`. Please align the annotation with actual usage (e.g., `list[str] | None` or `object`) or change the caller to pass a mapping instead of a list.

```suggestion
@dataclass
class TestConfigurationV1(TestConfiguration):
    """Container configuration metadata for a benchmark run."""

    container_image: str = ""
    container_runtime: str = ""
    inference_engine: str = ""
    version: Literal["v1"] = "v1"
    runtime_args: list[str] | None = None
```
</issue_to_address>

### Comment 5
<location> `ramalama/config.py:153-154` </location>
<code_context>
+    version: ClassVar[Any]
+
+
+@dataclass
+class DeviceInfoV1(DeviceInfo):
+    hostname: str
</code_context>

<issue_to_address>
**issue (bug_risk):** Config field name `storage_folder` conflicts with documented `db_path` key.

Unless there’s explicit aliasing between `db_path` and `storage_folder`, values set in the config file will be ignored and the default will always be used. Please either rename the field to match the documented key or introduce a compatibility mapping so existing configs continue to work as expected.
</issue_to_address>

### Comment 6
<location> `ramalama/transports/base.py:472` </location>
<code_context>
            result = subprocess.CompletedProcess(args=escaped_cmd, returncode=0, stdout="", stderr="")
</code_context>

<issue_to_address>
**security (python.lang.security.audit.dangerous-subprocess-use-audit):** Detected subprocess function 'CompletedProcess' without a static string. If this data can be controlled by a malicious actor, it may be an instance of command injection. Audit the use of this call to ensure it is not controllable by an external resource. You may consider using 'shlex.escape()'.

*Source: opengrep*
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature for persisting and viewing benchmark results. It adds a new benchmarks CLI command, enhances the bench command with structured JSON output, and includes configuration and documentation for these changes. The implementation is comprehensive, covering data schemas, storage management, CLI integration, and testing.

My review focuses on ensuring correctness, maintainability, and consistency between the code and its documentation. I've identified a few issues, including documentation inaccuracies regarding the storage mechanism (JSONL vs. SQLite), a potential bug with a mutable default in a dataclass, and an issue with JSON serialization of a generator. Addressing these points will improve the robustness and usability of this new feature.

@ieaves
Copy link
Collaborator Author

ieaves commented Jan 22, 2026

@olliewalsh something went wonky with #2237 and I had to open a new PR. Sorry! This contains the suggested documentation changes and refactor from sqlite to jsonl.


def get_default_benchmarks_storage_folder() -> Path:
conf_dir = None
for dir in DEFAULT_CONFIG_DIRS:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should use the store dir. The config dir may not be writable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍. Ideally we'd be able to default the benchmarks storage directory to the user specified store path in this case. Unfortunately, is_set doesn't currently support checking nested subfields so I put it off and instead had it use the default store path as the default benchmarks base path.

The config is getting pretty complicated at this point. Do you think it'd be worth bringing something like pydantic in and managing this through the likes of BaseSettings?

@ieaves
Copy link
Collaborator Author

ieaves commented Jan 24, 2026

@olliewalsh This should be good to go if you're ready to merge.

@ieaves ieaves temporarily deployed to macos-installer January 24, 2026 22:15 — with GitHub Actions Inactive
ramalama/cli.py Outdated

except MissingStorageFolderError:
print("Error: RAMALAMA__BENCHMARKS_STORAGE_FOLDER not configured")
sys.exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to raise an exception and let main() handle the exit code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might make sense to pull that code out of the try/except altogether in that case.

else:
dry_run(cmd)

result = subprocess.CompletedProcess(args=cmd, returncode=0, stdout="", stderr="")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? Everything else just returns None

Copy link
Collaborator Author

@ieaves ieaves Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, it just keeps the typing consistent without the early return. I'll switch out to an early return instead though since it's cleaner.

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
…ords

Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
@ieaves ieaves temporarily deployed to macos-installer January 25, 2026 04:18 — with GitHub Actions Inactive
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
@ieaves ieaves temporarily deployed to macos-installer January 25, 2026 04:54 — with GitHub Actions Inactive
Signed-off-by: Ian Eaves <ian.k.eaves@gmail.com>
@ieaves ieaves temporarily deployed to macos-installer January 25, 2026 07:55 — with GitHub Actions Inactive
@ieaves ieaves merged commit 739563a into containers:main Jan 25, 2026
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants